Han, Bo, Paul Cook and Timothy Baldwin (2013) unimelb: Spanish Text Normalisation, In Proceedings of the Tweet Normalization Workshop at SEPLN 2013 (Tweet-norm), Madrid, Spain, pp. 67-71

نویسندگان

  • Bo Han
  • Paul Cook
  • Timothy Baldwin
چکیده

This paper describes a lexicon-based text normalisation approach for Spanish tweets. We first compare English and Spanish text normalisation, and hypothesise that an approach previously proposed for English can be adapted to Spanish. A corpus-derived normalisation lexicon is built using distributional similarity, and is combined with existing lexicons (e.g., containing Spanish Internet slang). These lexicons enable a very fast, look-up based approach to text normalisation. Experimental results indicate that the corpus-derived lexicon complements existing lexicons, but that the approach could be improved through better handling of certain word types, such as named entities.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

unimelb: Spanish Text Normalisation

This paper describes a lexicon-based text normalisation approach for Spanish tweets. We first compare English and Spanish text normalisation, and hypothesise that an approach previously proposed for English can be adapted to Spanish. A corpus-derived normalisation lexicon is built using distributional similarity, and is combined with existing lexicons (e.g., containing Spanish Internet slang). ...

متن کامل

DLSI en Tweet-Norm 2013: Normalización de Tweets en Español

The lexical richness and its ease of access to large volumes of information converts the Web 2.0 into an important resource for Natural Language Processing. Nevertheless, the frequent presence of non-normative linguistic phenomena that can make any automatic processing challenging. In this paper is described the participation in the Text Normalisation Workshop at the SEPLN conference (Tweet-nor...

متن کامل

Lexical Normalization of Spanish Tweets with Preprocessing Rules, Domain-specific Edit Distances, and Language Models

We present a system to normalize Spanish tweets, which uses preprocessing rules, a domain-appropriate edit-distance model, and language models to select correction candidates based on context. The system’s results at SEPLN 2013 Tweet-Norm task were above-average.

متن کامل

Elhuyar at Tweet-Norm 2013

This paper presents the system developed by Elhuyar for the TweetNorm evaluation campaign which consists of normalizing Spanish tweets to standard language. The normalization covers only the correction of certain Out Of Vocabulary (OOV) words, previously identified by the organizers. The developed system follows a two step strategy. First, candidates for each OOV word are generated by means of ...

متن کامل

The TALP-UPC Approach to Tweet-Norm 2013

This paper describes the methodology used by the TALP-UPC team for the SEPLN 2013 shared task of tweet normalization (Tweet-Norm). The system uses a set of modules that propose different corrections for each out-of-vocabulary word. The final correction is chosen by weighted voting according to each module accuracy.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013